Authors: Chrystal Chan, Na Guo, Lucy Sun

Introduction

The Moneyball project inspired us to look at hidden factors that can be used to improve the performance of a sports player or team. The 2016 figure skating world championships was recently held in Boston at TD garden, which sparked our interest in this particular sport. Audience members would hold their breath at each jump or spin performed by the skaters, and the final results of each skater seem to be very unpredictable until the end of their performance. That led us to wonder, can the results of a sport as artistic as figure skating be predicted using past data? The International Skating Union is an international authoritative organization with public records on previous championship results as well as information on every individual who participated in any one of the worldwide events for figure skating. The ICU website gives us access to reliable data about skaters’ demographics, scores, rankings in the past and so on since 2008. We would like to apply what we have learned in data science to the field of figure skating in order to explore data from these professional competitions and predict the performance of skaters in competitions.

Overview on Loading Data

We used the ISU websites to download our datasets.

There are 3 types of data we used:

  1. Season Best Score - which includes information on each skater’s seasonal best score across all events.

Example: http://www.isuresults.com/isujsstat/sb2015-16/sbtsmto.htm

  1. Event Result - which gives the score breakdown for each skater per competition. Due to the large number of skating competitions each year, we restricted our event result data to the top 4 skating competitions: Four Continents Figure Skating Championships European Figure Skating Championships World Figure Skating Championships Grand Prix of Figure Skating Series

Example: http://www.isuresults.com/results/fc2009/SEG001.HTM

  1. Skater Biography - which includes biography data for each skater

Example: http://www.isuresults.com/bios/isufs00007684.htm

For figure skating, we have 4 categories: m - men l - ladies p - pair d - ice dance

Most of our analysis will focus on single events data for men and ladies. We decided to exclude pairs data because after loading the initial data, we realized that the sample size for pairs data is too small. Also, we would have needed to consider a lot more variables for the analysis of pairs data. Ex. if female skater is shorter than her male partner. However, we did still include doubles data in our descriptive analysis of the datasets. We excluded ice dance because the scoring rule is very different from other categories. Also there was a change in the judging system a few years ago, so we did not have enough historical data for this category.

Overview of Data Wrangling

After scrapping the datasets from the ISU website, we wrangled the raw data to create clean and useful datasets for our analysis. After cleaning the bio data, we kept only the variables that are useful for our analysis, including country, date of birth, height, and start_career. We also added new columns for continent, number of skaters in the country they’re from, and quintiles ranking countries by number of skaters. All other bio data were qualitative and would have been too difficult to run analysis on. After cleaning the event results data, we turned the dataset from a long format where short program and free skate results were in separate rows, to a wide format where the results are in the same row. We also added columns for total score. Then we combined the biography data with the event results data and season best score data. Finally we added a new variable to our datasets indicating the ranking for each skater by quintiles for the entire year in the season best score data, and for each competition for event results data. After the data was loaded and wrangled, we saved all the datasets that are to be used as Final datasets.RData. The code to scrape and wrangle the data is included in the rscript called FINAL data loading and wrangling R code.R. In this document, we will start by uploading the cleaned datasets.

library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(readr)
library(countrycode)
library(knitr)
library(stats)
library(broom)

Loading already scraped and wrangled datasets

load("/Users/NathalieGuo/Downloads/FINAL datasets.RData")

Summary Statistics

For our exploratory analysis, we looked at different variables from our biography dataset and how they relate to Season Best Scores, and individual event scores.

Country

Our data consists of information on skaters from 72 countries in 5 continents.

return(c(length(unique(bio$country)),length(unique(bio$continent))))
## [1] 72  5

We first want to know the distribution of the number of skaters in each country across the years. Based on our graphs, we can see that USA, Canada, and Russia are the top three countries where the greatest number of skaters are from. This makes sense as ice skating is one of the more popular sports in those countries. We ranked countries into quintiles based on number of skaters from that country, and used the country rank variable later in our regression analysis as a predictor for skater scores.

bio %>% filter(category=="m"&num_quintile%in%c("80~100%","60~80%")) %>% 
  ggplot(aes(x=reorder(country,-numofskater),y=numofskater,fill=continent))+
  geom_bar(stat="identity")+xlab("Country")+ylab("number of male single skaters") +
  ggtitle('Number of Male Skaters by Country')

bio %>% filter(category=="l"&num_quintile%in%c("80~100%","60~80%")) %>% 
  ggplot(aes(x=reorder(country,-numofskater),y=numofskater,fill=continent))+
  geom_bar(stat="identity")+xlab("Country")+ylab("number of female single skaters") +
  ggtitle('Number of Female Skaters by Country')

We looked at which countries had the highest number of skaters, as well as which countries had the highest Season Best Scores. If we look at the mean scores throughout the years, we see that Russia has the highest average Season Best Score for ladies. Uzbekistan has highest average Season Best Score in men. China has the highest average Season Best Score for pairs. USA has the highest number of skaters participating in both single and doubles events.

data <- SeasonBestScoreSingle.df %>%
  group_by(program_type,category, country) %>%
  summarise(count = n(), score_mean = mean(score), score_sd = sd(score)) %>%
  mutate(group = paste(category,program_type))  

#country with maximum scores
maxScoreSingle.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$score_mean),]))
maxScoreSingle.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
## 
##   program_type category country count score_mean  score_sd group
##          (chr)    (chr)   (chr) (int)      (dbl)     (dbl) (chr)
## 1           fs        l     RUS    74  111.08568 18.740378  l fs
## 2           sp        l     RUS    74   59.69649  8.580273  l sp
## 3           to        l     RUS   104  161.41683 28.555741  l to
## 4           fs        m     UZB     5  142.53600 10.688486  m fs
## 5           sp        m     UZB     5   72.66000  6.997607  m sp
## 6           to        m     UZB     6  208.48833 19.849003  m to
#country with maximum count
maxCountSingle.df <- do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$count),]))
maxCountSingle.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
## 
##   program_type category country count score_mean  score_sd group
##          (chr)    (chr)   (chr) (int)      (dbl)     (dbl) (chr)
## 1           fs        l     USA    97  102.59732 16.742392  l fs
## 2           sp        l     USA    97   55.35495  7.722283  l sp
## 3           to        l     USA   156  152.02635 23.322165  l to
## 4           fs        m     USA    89  132.11112 25.292853  m fs
## 5           sp        m     USA    90   67.44122 12.485054  m sp
## 6           to        m     USA   140  196.78471 34.521255  m to
data <- SeasonBestScoreDouble.df %>%
  group_by(program_type,category, country) %>%
  summarise(count = n(), score_mean = mean(score), score_sd = sd(score)) %>%
  mutate(group = paste(category,program_type)) 
  
#country with maximum scores
maxScoreDouble.df <-  do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$score_mean),]))
maxScoreDouble.df 
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
## 
##   program_type category country count score_mean  score_sd group
##          (chr)    (chr)   (chr) (int)      (dbl)     (dbl) (chr)
## 1           fd        d     AZE     3   86.46667  3.639579  d fd
## 2           sd        d     RUS    70   57.30243  7.674735  d sd
## 3           to        d     AZE     4  146.25750 10.535386  d to
## 4           fs        p     CHN    32  113.06281 21.157853  p fs
## 5           sp        p     CHN    32   60.17625 11.364523  p sp
## 6           to        p     CHN    50  168.59660 33.640941  p to
#country with maximum count
maxCountDouble.df <-   do.call(rbind,lapply(split(data,data$group),function(chunk) chunk[which.max(chunk$count),]))  
maxCountDouble.df
## Source: local data frame [6 x 7]
## Groups: program_type, category [1]
## 
##   program_type category country count score_mean score_sd group
##          (chr)    (chr)   (chr) (int)      (dbl)    (dbl) (chr)
## 1           fd        d     USA    73   82.14068 15.31989  d fd
## 2           sd        d     USA    73   55.30671 10.48632  d sd
## 3           to        d     USA   117  141.22974 27.70363  d to
## 4           fs        p     RUS    68  108.53382 21.01386  p fs
## 5           sp        p     RUS    68   57.58147 10.89465  p sp
## 6           to        p     USA   108  141.72407 25.69334  p to

We compared Season Best Scores for the countries with the highest number of skaters (ie. Japan, USA, Russia, and Canada), to all other countries. The graphs show that there is a significant difference in Season Best Scores between these countries. Suggesting that country of origin may be a significant predictor for a skater’s scores.

plot.df <-
  SeasonBestScoreSingle.df %>%
  filter(program_type =="to") %>%
  mutate(display_country=ifelse(country  %in% c("RUS","USA", "CAN","JPN") , country, "other countries"))%>% 
    group_by(category, display_country,season_start_year) %>%
    summarize(score = mean(score), count = n()
    )

ggplot(plot.df ,aes(x=season_start_year,y=score)) + 
    geom_point(aes(color=display_country,group=display_country, size = count))+ 
    facet_wrap(~category, scales = "free") + 
    xlab("season start year") +ylab("mean score by country")  +
    ggtitle("SBS for Singles by Country")

plot.df <-
  SeasonBestScoreDouble.df %>%
  filter(program_type =="to") %>%
  mutate(display_country=ifelse(country  %in% c("RUS","USA", "CAN","JPN") , country, "other countries"))%>% 
    group_by(category, display_country,season_start_year) %>%
    summarize(score = mean(score), count = n()
    )

ggplot(plot.df ,aes(x=season_start_year,y=score)) + 
    geom_point(aes(color=display_country,group=display_country, size = count))+ 
    facet_wrap(~category, scales = "free") + 
    xlab("season start year") +ylab("mean score by country")  +
    ggtitle("SBS for Doubles by Country")

Based on these results, we determined that country of origin could be a potential variable for predicting a skater’s performance.

Height

Next we were interested in including height as a potential variable for prediction, so looked at the distribution of height by gender. When we used qqnorm and qqline to check the normality of the distributions, we found that the height of female skaters is normally distributed, however, the height of male skaters is a little left-skewed.

bio %>% filter(is.na(height)==FALSE) %>% ggplot(aes(height)) +
  geom_histogram()+facet_grid(.~category)+stat_bin(bins=30) +
  ggtitle('Heights by Gender')

bio %>% filter(is.na(height)==FALSE) %>% 
  ggplot(aes(continent,height,color=continent)) +
  geom_boxplot()+ facet_grid(.~category) +
  ggtitle('Heights by Gender')

Gender/Categories

Next, we looked at the distribution of Season Best Scores and Event Scores by categories: Men Singles, Ladies Singles, Pairs, Ice Dancing.

  #let see if the year distribution is the same
  plot.df <- SeasonBestScoreSingle.df %>%
    filter(program_type == "to") %>%
    mutate(group = paste(category,program_type))  
  
  ggplot(filter(plot.df,category == "m") ,aes(score)) + 
    geom_histogram()+ 
    facet_wrap(~season_start_year, scales = "fixed") + 
    ggtitle("Season Best Score Distribution for every year for men total score")

  qqnorm(filter(plot.df,category == "m"&season_start_year==2015)$score)
  qqline(filter(plot.df,category == "m"&season_start_year==2015)$score)

  ggplot(filter(plot.df,category == "l") ,aes(score)) + 
    geom_histogram()+ 
    facet_wrap(~season_start_year, scales = "fixed") + 
    ggtitle("Season Best Score Distribution for every year for ladies total score") 

  qqnorm(filter(plot.df,category == "l"&season_start_year==2015)$score)
  qqline(filter(plot.df,category == "l"&season_start_year==2015)$score)

plot.df <- SeasonBestScoreDouble.df %>%
  filter(program_type == "to") %>%
  mutate(group = paste(category,program_type))  

ggplot(filter(plot.df,category == "p") ,aes(score)) + 
  geom_histogram()+ 
  facet_wrap(~season_start_year, scales = "fixed") + 
  ggtitle("Season Best Score Distribution for every year for pair program total score")

qqnorm(filter(plot.df,category == "p"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "p"&season_start_year==2015)$score)

ggplot(filter(plot.df,category == "d") ,aes(score)) + 
  geom_histogram()+ 
  facet_wrap(~season_start_year, scales = "fixed") + 
  ggtitle("Season Best Score Distribution for every year for ice dancing total score") 

qqnorm(filter(plot.df,category == "d"&season_start_year==2015)$score)
qqline(filter(plot.df,category == "d"&season_start_year==2015)$score)

For men, we can see that the distribution of Season Best Score is fairly normal across the years. With the average increasing over the years.

For ladies, the SBS distribution in earlier years are more right skewed, but have become more normally distribution since 2012. However from the QQ plot, we can see that tails still do not follow a normal distribution as well the men’s distribution do.

For both pairs and ice dancing, the SBS distribution do not follow normality, as seen on the QQ plot. There are many “gaps” in the distribution. However, the lack of normal distribution could be due to the fact that pairs and ice dance have fewer skaters than men singles and ladies singles.

Looking at average individual event scores over the years by gender, we can see that men consistently score higher than women.

single_gender<-EventResultSingle_wr%>%
  group_by(program_type,category,season_start_year)%>%
  summarize(TSS=mean(TSS,na.rm=TRUE),TES=mean(TES,na.rm=TRUE),PCS=mean(PCS,na.rm=TRUE),
            SS=mean(SS,na.rm=TRUE),TR=mean(TR,na.rm=TRUE),PE=mean(PE,na.rm=TRUE),
            CH=mean(CH,na.rm=TRUE),IN=mean(IN,na.rm=TRUE))

single_gender%>%
  ggplot(aes(x=season_start_year,y=TSS,group=category,color=category))+
  geom_line()+geom_point()+ggtitle('TSS Scores by Gender')+facet_wrap(~program_type)+ 
  scale_x_continuous(breaks=c(seq(2005,2015, by=2)))

These results suggest that we should run separate predictive models for Men and Women.

Score Breakdowns

To understand the score breakdown, each competition includes scores for two events, Short Program and Free Skate. For each event, there’s a total score (TSS) which is a sum of the total elements score (TES), and the program component score (PCS). The program component score includes scores for 5 individual component: skating skills (SS), Transitions (TR), Performance/Execution (PE), Choreography / Composition (CH), and Interpretation (IN). Each component is marked with a value from 0 to 10 in 0.25 increments. These five marks are then multiplied by a factor depending on the type of program and level. For senior ladies and pairs, the factor is 0.8 for the short program and 1.6 for the long program. This means that PCS is more important in the long program. *Note:Since we will not focus on ice dance, we will not look at free dance (fd) and short dance (sd).

Source: http://gofigureskating.com/compete/scoring.html

Season Best Scores

Please note that we only have SBS for “Short Program” and “Free Skating” since 2011.

ggplot(data= filter(SeasonBestScoreSingle.df, season_start_year >= 2011) ,aes(x=category,y=score))+
  geom_boxplot(aes(color=category))  +ggtitle("SBS for Singles") +xlab("score type") +ylab("SBS")  + 
 facet_wrap(~program_type, scales = "fixed")

ggplot(data= filter(SeasonBestScoreDouble.df, season_start_year >= 2011) ,aes(x=program_type,y=score))+
  geom_boxplot(aes(color=program_type))  +ggtitle("SBS for Doubles") +xlab("score type") +ylab("SBS")  + 
 facet_wrap(~category, scales = "fixed")

Overall, we saw that free skating has higher SBS than short program for each category.

#create a combination table
todata <- 
  SeasonBestScoreSingle.df%>%
  filter(program_type == "to") %>%
  mutate(event_to = event, date_to= date, score_to = score) %>%
  select (first_name, last_name, country, season_start_year, category, event_to, date_to, score_to)
  
fsdata <- 
  SeasonBestScoreSingle.df%>%
  filter(program_type == "fs") %>%
  mutate(event_fs = event, date_fs= date, score_fs = score) %>%
  select (first_name, last_name, country, season_start_year,category, event_fs, date_fs, score_fs)  
  
spdata <- 
  SeasonBestScoreSingle.df%>%
  filter(program_type == "sp") %>%
  mutate(event_sp = event, date_sp= date, score_sp = score) %>%
  select (first_name, last_name, country, season_start_year,category, event_sp, date_sp, score_sp)  



alldata <- full_join(todata, fsdata)
alldata <- full_join(alldata,spdata)

data<- alldata %>%
  filter(season_start_year >= 2011) %>%
  filter(!is.na(score_fs)&!is.na(score_sp)) %>%
  mutate(sameEventForFSandSP = ifelse(event_fs==event_sp, TRUE, FALSE)) %>%
  mutate (score_diff = score_fs - score_sp)


plot.df <- 
  data %>%
  group_by(season_start_year, category, sameEventForFSandSP) %>%
  summarise(count= n(), score_to_mean = mean(score_to) )

ggplot(plot.df ,aes(x=season_start_year,y=score_to_mean)) + 
  geom_line(aes(color=sameEventForFSandSP,group=sameEventForFSandSP))+ 
  facet_wrap(~category, scales = "free") + 
  ggtitle("Skaters who have SBS from same event vs. not")  +xlab("season start year") +ylab("mean SBS total score")

We can see that in general skaters who have their Season Best Scores for free skating and short program from different events usually have higher total Season Best Scores.

Event Scores

We looked at Program Component Scores, Total Elements Scores, and Total Scores over the years for each of the four competitions in the data set. Based on the graphs, we can see that GPF consistently has the highest scores, followed by World Championships, European Championships, and Four Continents.

#Scores over the year by event
singles_scores<-EventResultSingle_wr%>%
  group_by(program_type,event_code,season_start_year)%>%
  summarize(TSS=mean(TSS,na.rm=TRUE),TES=mean(TES,na.rm=TRUE),PCS=mean(PCS,na.rm=TRUE),
            SS=mean(SS,na.rm=TRUE),TR=mean(TR,na.rm=TRUE),PE=mean(PE,na.rm=TRUE),
            CH=mean(CH,na.rm=TRUE),IN=mean(IN,na.rm=TRUE))

singles_scores%>%
  ggplot(aes(x=season_start_year,y=TSS,group=event_code,color=event_code))+
  geom_line()+geom_point()+ggtitle('TSS Scores')+facet_wrap(~program_type) + 
  scale_x_continuous(breaks=c(seq(2005,2015, by=2)))

singles_scores%>%
ggplot(aes(x=season_start_year,y=TES,group=event_code,color=event_code))+
  geom_line()+geom_point()+ggtitle('TES Scores')+facet_wrap(~program_type)+ 
  scale_x_continuous(breaks=c(seq(2005,2015, by=2)))

singles_scores%>%
  ggplot(aes(x=season_start_year,y=PCS,group=event_code,color=event_code))+
  geom_line()+geom_point()+ggtitle('PCS Scores')+facet_wrap(~program_type)+ 
  scale_x_continuous(breaks=c(seq(2005,2015, by=2)))

Below is a summary of the average number events per skater, as well as number of participants for each event by gender. We can see that the Grand Prix Final has a lot fewer participants than the other competitions. This is because the Grand Prix Final is a cumulation of the ISU Grand Prix of Figure Skating series, which consists of the Skate America, Skate Canada International, Trophee Eric Bompard, Cup of China, Cup of Russia, and NHK Trophy competitions. Only the top 6 skaters from each discipline go on to compete in the final.

length(unique(EventResultSingle_wr$full_name))
## [1] 686
#686 total number of single's event attendents

#looking at events per skater
Unique_events<-EventResultSingle_wr%>%group_by(full_name)%>%
  summarize(events=length(event_year))
summary(Unique_events)
##   full_name             events     
##  Length:686         Min.   : 1.00  
##  Class :character   1st Qu.: 2.00  
##  Mode  :character   Median : 2.00  
##                     Mean   : 5.02  
##                     3rd Qu.: 6.00  
##                     Max.   :34.00
hist(Unique_events$events, breaks=25, main='Average Events per Skater', xlab='Unique Events',
     ylab='Number of Skaters')

#Summary of Number of skaters for each event
EventResultSingle_wr%>%group_by(season_start_year,event_code,category)%>%
  summarize(count=n())%>%
  ggplot(aes(x=season_start_year,y=count,fill=category))+
  geom_bar(stat='identity',position=position_dodge())+facet_wrap(~event_code)

EventResultSingle_wr%>%group_by(event_code,category)%>%summarize(n())
## Source: local data frame [8 x 3]
## Groups: event_code [?]
## 
##   event_code category   n()
##        (chr)    (chr) (int)
## 1         ec   Ladies   553
## 2         ec      Men   577
## 3         fc   Ladies   547
## 4         fc      Men   517
## 5        gpf   Ladies    95
## 6        gpf      Men    91
## 7         wc   Ladies   478
## 8         wc      Men   586

Correlation Matrix

We ran a correlation matrix of the component scores Free skating against component scores for Short Program. The purpose of this was to find which individual components were more correlated with the total score, and also which components were more correlated with each other. We wanted to see if there are certain components that correlate with the total score more, and recommend that skaters invest more time in improving those components. However, our results showed a high correlation with between all the components, indicating that all components are equally important for skaters.

x<-single_event%>%select(SS_FS:IN_FS,SS_SP:IN_SP)
x<-na.omit(x)

cor_matrix<-data.frame(cor(x))
cov_matrix<-data.frame(cov(x))
round(cor_matrix, digits=2)
##       SS_FS TR_FS PE_FS CH_FS IN_FS SS_SP TR_SP PE_SP CH_SP IN_SP
## SS_FS  1.00  0.99  0.99  0.99  0.99  0.97  0.97  0.97  0.97  0.97
## TR_FS  0.99  1.00  0.99  0.99  0.99  0.96  0.97  0.97  0.97  0.97
## PE_FS  0.99  0.99  1.00  0.99  0.99  0.96  0.96  0.96  0.96  0.96
## CH_FS  0.99  0.99  0.99  1.00  1.00  0.97  0.97  0.97  0.97  0.97
## IN_FS  0.99  0.99  0.99  1.00  1.00  0.96  0.96  0.96  0.96  0.97
## SS_SP  0.97  0.96  0.96  0.97  0.96  1.00  0.99  0.99  0.99  0.99
## TR_SP  0.97  0.97  0.96  0.97  0.96  0.99  1.00  0.99  0.99  0.99
## PE_SP  0.97  0.97  0.96  0.97  0.96  0.99  0.99  1.00  0.99  0.99
## CH_SP  0.97  0.97  0.96  0.97  0.96  0.99  0.99  0.99  1.00  1.00
## IN_SP  0.97  0.97  0.96  0.97  0.97  0.99  0.99  0.99  1.00  1.00
round(cov_matrix, digits=2)
##       SS_FS TR_FS PE_FS CH_FS IN_FS SS_SP TR_SP PE_SP CH_SP IN_SP
## SS_FS  1.65  1.68  1.70  1.71  1.75  1.54  1.55  1.58  1.58  1.63
## TR_FS  1.68  1.74  1.75  1.76  1.81  1.57  1.60  1.62  1.61  1.67
## PE_FS  1.70  1.75  1.79  1.78  1.84  1.58  1.60  1.63  1.62  1.68
## CH_FS  1.71  1.76  1.78  1.80  1.85  1.60  1.62  1.64  1.64  1.70
## IN_FS  1.75  1.81  1.84  1.85  1.91  1.64  1.66  1.69  1.69  1.75
## SS_SP  1.54  1.57  1.58  1.60  1.64  1.52  1.53  1.55  1.55  1.60
## TR_SP  1.55  1.60  1.60  1.62  1.66  1.53  1.56  1.57  1.57  1.62
## PE_SP  1.58  1.62  1.63  1.64  1.69  1.55  1.57  1.60  1.59  1.65
## CH_SP  1.58  1.61  1.62  1.64  1.69  1.55  1.57  1.59  1.60  1.65
## IN_SP  1.63  1.67  1.68  1.70  1.75  1.60  1.62  1.65  1.65  1.71

Linear Regression

In this section, we want to explore the associations between biography variables and each outcome of interest by using linear regression models stratified by sex. Biography variables include current age at event year, height, career length in years and category of country regarding the quintiles of number of sex-specific skaters. Outcomes are event total score, season best total score, season best score of short program and season best score of free skating.

Before we built models, ggplots were used to visualize the associations. Univariate linear regression models were built for each combination of biography variables and outcomes. Since all the associations between single predictor and outcomes are statistically significant, we included them all in the final multivariate linear regression models.

Linear Regression for Season Best Scores

bsb<- seasonbestsingle_bio
bsb %>% filter(is.na(name)==TRUE) %>% summarise(n())
## Source: local data frame [1 x 1]
## 
##     n()
##   (int)
## 1   260

Out of 7295 observations, there are 260 missing biography info.

##filter out observations with missing biography info
bsb <- bsb %>% filter(is.na(name)==FALSE) 

Part 1: season best total score

bsb_to<- bsb %>% filter(program_type=="to") 

bsb_to %>% ggplot(aes(age,score,color=category))+geom_point()+
  geom_smooth(method="lm")+facet_grid(.~category) + ggtitle('Association between Age and Total Score')

bsb_to %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term  estimate std.error statistic      p.value
##      (chr)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 60.446374 4.7691940 12.674337 3.292914e-35
## 2        l         age  3.407357 0.2762010 12.336511 1.603907e-33
## 3        m (Intercept) 42.016774 6.1057150  6.881549 9.128167e-12
## 4        m         age  6.491712 0.3195556 20.314815 4.893443e-80

For both ladies and men, age has a significantly positive relationship with season best total score.

bsb_to %>% ggplot(aes(height,score,color=category))+ geom_point()+geom_smooth(method="lm")+
  facet_grid(.~category) + ggtitle('Association between Height and Total Score')

bsb_to %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term    estimate  std.error statistic      p.value
##      (chr)       (chr)       (dbl)      (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 231.3744361 24.3311874  9.509377 6.754287e-21
## 2        l      height  -0.6926333  0.1502604 -4.609552 4.355803e-06
## 3        m (Intercept) 335.5009789 35.9249173  9.338949 4.188183e-20
## 4        m      height  -0.9836573  0.2072098 -4.747156 2.296034e-06

For both ladies and men, height has a significantly negative relationship with season best total score.

bsb_to %>% ggplot(aes(career_length,score,color=category))+ geom_point()+geom_smooth(method="lm")+
  facet_grid(.~category) + ggtitle('Association between Career Length and Total Score')

bsb_to %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category          term  estimate std.error statistic       p.value
##      (chr)         (chr)     (dbl)     (dbl)     (dbl)         (dbl)
## 1        l   (Intercept) 75.478406 2.7697810  27.25068 1.529613e-134
## 2        l career_length  3.717067 0.2242265  16.57729  4.617092e-57
## 3        m   (Intercept) 90.861278 3.6268479  25.05241 4.707949e-113
## 4        m career_length  5.666503 0.2633873  21.51396  6.819334e-88

For both ladies and men, career length has a significantly positive relationship with season best total score.

bsb_to %>% ggplot(aes(num_quintile,score,color=category))+
  geom_boxplot()+facet_grid(category~.) + ggtitle('Association between Country Rank and Total Score')

bsb_to %>% group_by(category) %>% 
  do(tidy(lm(score~ num_quintile,data= .),data= .)) 
## Source: local data frame [10 x 6]
## Groups: category [2]
## 
##    category                term  estimate std.error statistic
##       (chr)               (chr)     (dbl)     (dbl)     (dbl)
## 1         l         (Intercept)  89.52679  1.524551 58.723400
## 2         l  num_quintile20~40%  13.36064  2.111970  6.326147
## 3         l  num_quintile40~60%  18.48902  2.223786  8.314211
## 4         l  num_quintile60~80%  44.78198  2.042556 21.924478
## 5         l num_quintile80~100%  62.71311  2.147680 29.200399
## 6         m         (Intercept) 132.63691  2.540832 52.202162
## 7         m  num_quintile20~40%  11.09490  3.604352  3.078195
## 8         m  num_quintile40~60%  14.65078  3.639055  4.025985
## 9         m  num_quintile60~80%  54.50751  3.292861 16.553237
## 10        m num_quintile80~100%  65.25258  3.627232 17.989637
## Variables not shown: p.value (dbl)

Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best total score.

##Multivariate linear regression model for ladies
fit_to_l<- bsb_to %>% filter(category=="l") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_to_l)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.225 -17.745  -2.402  16.589  83.630 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         138.3338    18.2505   7.580 5.89e-14 ***
## age                   0.7990     0.3791   2.107   0.0352 *  
## height               -0.5436     0.1114  -4.877 1.18e-06 ***
## career_length         2.5670     0.3231   7.945 3.68e-15 ***
## num_quintile20~40%   11.2466     2.0468   5.495 4.56e-08 ***
## num_quintile40~60%   13.0063     2.1694   5.995 2.52e-09 ***
## num_quintile60~80%   40.2693     1.9659  20.484  < 2e-16 ***
## num_quintile80~100%  56.6217     2.1371  26.495  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.92 on 1570 degrees of freedom
##   (91 observations deleted due to missingness)
## Multiple R-squared:  0.5035, Adjusted R-squared:  0.5013 
## F-statistic: 227.4 on 7 and 1570 DF,  p-value: < 2.2e-16
##Multivariate linear regression model for Men
fit_to_m<- bsb_to %>% filter(category=="m") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_to_m)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -134.556  -21.993   -3.643   18.345  135.851 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         224.1133    26.9699   8.310 2.46e-16 ***
## age                   2.7098     0.4818   5.624 2.30e-08 ***
## height               -1.0082     0.1526  -6.608 5.72e-11 ***
## career_length         3.0343     0.4061   7.471 1.48e-13 ***
## num_quintile20~40%    6.6950     3.0849   2.170   0.0302 *  
## num_quintile40~60%   12.5680     3.1391   4.004 6.60e-05 ***
## num_quintile60~80%   42.8846     2.8393  15.104  < 2e-16 ***
## num_quintile80~100%  54.8501     3.1609  17.353  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.35 on 1259 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.5059, Adjusted R-squared:  0.5031 
## F-statistic: 184.1 on 7 and 1259 DF,  p-value: < 2.2e-16

In the final multivariate models for men and ladies, each predictor still has statistically significant relationship with season best total score after adjusting the rest of the biography variables. Nearly 50% of the variance of season best total score can be explained by age, height, career length and country rank.

##Comparing the point estimate of coefficients with 95% conficence interval between sex. 
fit_to<- bsb_to %>% group_by(category) %>% 
  do(tidy(lm(score~ age+height+career_length+num_quintile, data= .), data= .))

fit_to<- tbl_df(fit_to)

fit_to %>% filter(term=="age") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) + 
  ggtitle('Estimate based on Age')+geom_hline(yintercept = 0)

fit_to %>% filter(term=="height") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('Estimate based on Height')+geom_hline(yintercept = 0)

fit_to %>% filter(term=="career_length") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
  geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
  coord_fixed(ratio=1.5) + ggtitle('Estimate based on Career Length')+geom_hline(yintercept = 0)

fit_to %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
                         "num_quintile60~80%","num_quintile80~100%")) %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) + ggtitle('Estimate based on Country Rank')+geom_hline(yintercept = 0)

We can see the magnitude of the associations of age and height with season best total score varies by sex.

Part 2: season best score of short program

bsb_sp<- bsb %>% filter(program_type=="sp") 

bsb_sp %>% ggplot(aes(age,score,color=category))+geom_point()+
  geom_smooth(method="lm")+facet_grid(.~category)+ 
  ggtitle('Association between Age and SP Score')

bsb_to %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term  estimate std.error statistic      p.value
##      (chr)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 60.446374 4.7691940 12.674337 3.292914e-35
## 2        l         age  3.407357 0.2762010 12.336511 1.603907e-33
## 3        m (Intercept) 42.016774 6.1057150  6.881549 9.128167e-12
## 4        m         age  6.491712 0.3195556 20.314815 4.893443e-80

For both ladies and men, age has a significantly positive relationship with season best score of short program.

bsb_sp %>% ggplot(aes(height,score,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
  ggtitle('Association between Height and SP score')

bsb_sp %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term    estimate   std.error statistic      p.value
##      (chr)       (chr)       (dbl)       (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 100.7323231 10.38638524  9.698497 2.223014e-21
## 2        l      height  -0.3536300  0.06415242 -5.512341 4.430405e-08
## 3        m (Intercept) 112.8138506 14.89958200  7.571612 9.135357e-14
## 4        m      height  -0.3196054  0.08612681 -3.710871 2.192384e-04

For both ladies and men, height has a significantly negative relationship with season best score of short program.

bsb_sp %>% ggplot(aes(career_length,score,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
  ggtitle('Association between Career Length and SP Score')

bsb_sp %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category          term  estimate std.error statistic      p.value
##      (chr)         (chr)     (dbl)     (dbl)     (dbl)        (dbl)
## 1        l   (Intercept) 28.412361 1.2620557  22.51276 4.228408e-92
## 2        l career_length  1.286085 0.1024824  12.54933 8.685723e-34
## 3        m   (Intercept) 32.760015 1.5197246  21.55655 2.336999e-83
## 4        m career_length  1.887580 0.1097136  17.20461 1.545787e-57

For both ladies and men, career length has a significantly positive relationship with season best score of short program.

bsb_sp %>% ggplot(aes(num_quintile,score,color=category))+
  geom_boxplot()+facet_grid(category~.) + ggtitle('Association between Country Rank and SP Score')

bsb_sp %>% group_by(category) %>% 
  do(tidy(lm(score~ num_quintile,data= .),data= .)) 
## Source: local data frame [10 x 6]
## Groups: category [2]
## 
##    category                term  estimate std.error statistic
##       (chr)               (chr)     (dbl)     (dbl)     (dbl)
## 1         l         (Intercept) 33.123612 0.6447213 51.376633
## 2         l  num_quintile20~40%  4.701086 0.9068478  5.183986
## 3         l  num_quintile40~60%  6.270518 0.9635717  6.507578
## 4         l  num_quintile60~80% 15.303851 0.8732485 17.525195
## 5         l num_quintile80~100% 22.731975 0.9266348 24.531752
## 6         m         (Intercept) 46.767396 1.0066236 46.459666
## 7         m  num_quintile20~40%  3.993733 1.4350153  2.783060
## 8         m  num_quintile40~60%  5.317126 1.5008248  3.542803
## 9         m  num_quintile60~80% 19.187754 1.3595138 14.113689
## 10        m num_quintile80~100% 21.754738 1.4830991 14.668432
## Variables not shown: p.value (dbl)

Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best score of short program.

##Multivariate linear regression model for ladies
fit_sp_l<- bsb_sp %>% filter(category=="l") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_sp_l)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9226  -6.3346  -0.3698   6.1869  26.1133 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         63.96516    7.83919   8.160 9.53e-16 ***
## age                  0.19328    0.16087   1.201     0.23    
## height              -0.27428    0.04812  -5.700 1.55e-08 ***
## career_length        1.01311    0.13975   7.249 8.11e-13 ***
## num_quintile20~40%   3.98783    0.88314   4.515 7.03e-06 ***
## num_quintile40~60%   4.30449    0.93948   4.582 5.16e-06 ***
## num_quintile60~80%  14.08897    0.84310  16.711  < 2e-16 ***
## num_quintile80~100% 20.30167    0.92630  21.917  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.883 on 1051 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.5088, Adjusted R-squared:  0.5055 
## F-statistic: 155.5 on 7 and 1051 DF,  p-value: < 2.2e-16
##Multivariate linear regression model for Men
fit_sp_m<- bsb_sp %>% filter(category=="m") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_sp_m)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.764  -8.130  -0.810   6.791  41.601 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         82.16953   11.25543   7.300 6.40e-13 ***
## age                  0.99115    0.19653   5.043 5.56e-07 ***
## height              -0.37479    0.06457  -5.804 9.02e-09 ***
## career_length        0.99938    0.16907   5.911 4.85e-09 ***
## num_quintile20~40%   2.63938    1.23667   2.134 0.033095 *  
## num_quintile40~60%   4.30340    1.30271   3.303 0.000994 ***
## num_quintile60~80%  15.46956    1.16964  13.226  < 2e-16 ***
## num_quintile80~100% 17.98100    1.28734  13.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 883 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.4937, Adjusted R-squared:  0.4897 
## F-statistic:   123 on 7 and 883 DF,  p-value: < 2.2e-16

In the final multivariate models for men and ladies,nearly all the predictors still have statistically significant relationship with season best score of short program after adjusting the rest of the biography variables, except for the association of age with the outcome among ladies. Around 50% of the variance of season best score of short program can be explained by age, height, career length and country rank.

##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit_sp<- bsb_sp %>% group_by(category) %>% 
  do(tidy(lm(score~ age+height+career_length+num_quintile,data= .),data=.))

fit_sp<- tbl_df(fit_sp)

fit_sp %>% filter(term=="age") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('SP Estimate based on Age')+geom_hline(yintercept = 0)

fit_sp %>% filter(term=="height") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('SP Estimate based on Height')+geom_hline(yintercept = 0)

fit_sp %>% filter(term=="career_length") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
  geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
  coord_fixed(ratio=1.5) + ggtitle('SP Estimate based on Career Length')+geom_hline(yintercept = 0)

fit_sp %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
                         "num_quintile60~80%","num_quintile80~100%")) %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) +
  ggtitle('SP Estimate based on Country Rank')+geom_hline(yintercept = 0)

We can see the magnitude of the associations of age with season best score of short program varies by sex.

Part 3: season best score of free skating

bsb_fs<- bsb %>% filter(program_type=="fs") 

bsb_fs %>% ggplot(aes(age,score,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category)+
  ggtitle('Association between Age and FS Score')

bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ age,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term  estimate std.error statistic      p.value
##      (chr)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 46.013022 4.2370918 10.859576 3.719303e-26
## 2        l         age  1.941041 0.2459612  7.891654 7.228734e-15
## 3        m (Intercept) 30.935479 5.0415615  6.136091 1.273917e-09
## 4        m         age  4.254407 0.2640957 16.109342 2.115192e-51

For both ladies and men, age has a significantly positive relationship with season best score of free skating.

bsb_fs %>% ggplot(aes(height,score,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category)+
  ggtitle('Association between Height and FS Score')

bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ height,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term    estimate  std.error statistic      p.value
##      (chr)       (chr)       (dbl)      (dbl)     (dbl)        (dbl)
## 1        l (Intercept) 192.5054564 20.6228505  9.334571 5.919500e-20
## 2        l      height  -0.6972585  0.1273909 -5.473379 5.529755e-08
## 3        m (Intercept) 223.9802303 30.5294436  7.336532 5.080468e-13
## 4        m      height  -0.6503897  0.1765796 -3.683265 2.446831e-04

For both ladies and men, height has a significantly negative relationship with season best score of free skating.

bsb_fs %>% ggplot(aes(career_length,score,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
  ggtitle('Association between Career Length and FS Score')

bsb_fs %>% group_by(category) %>% do(tidy(lm(score~ career_length,data= .),data= .))
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category          term  estimate std.error statistic      p.value
##      (chr)         (chr)     (dbl)     (dbl)     (dbl)        (dbl)
## 1        l   (Intercept) 50.844946 2.4950782  20.37810 8.000505e-78
## 2        l career_length  2.469671 0.2037479  12.12121 1.047491e-31
## 3        m   (Intercept) 62.722539 3.0637166  20.47270 3.957547e-76
## 4        m career_length  3.760502 0.2230866  16.85669 2.851872e-55

For both ladies and men, career length has a significantly positive relationship with season best score of free skating.

bsb_fs %>% ggplot(aes(num_quintile,score,color=category))+
  geom_boxplot()+facet_grid(category~.) +
  ggtitle('Association between Country Rank and FS Score')

bsb_fs %>% group_by(category) %>% 
  do(tidy(lm(score~ num_quintile,data= .),data= .)) 
## Source: local data frame [10 x 6]
## Groups: category [2]
## 
##    category                term  estimate std.error statistic
##       (chr)               (chr)     (dbl)     (dbl)     (dbl)
## 1         l         (Intercept) 58.615674  1.285548 45.595857
## 2         l  num_quintile20~40% 10.145326  1.807680  5.612344
## 3         l  num_quintile40~60% 12.380370  1.898663  6.520572
## 4         l  num_quintile60~80% 30.892129  1.731626 17.839951
## 5         l num_quintile80~100% 44.780335  1.822302 24.573492
## 6         m         (Intercept) 89.676889  2.066315 43.399420
## 7         m  num_quintile20~40%  7.285817  2.964873  2.457379
## 8         m  num_quintile40~60% 10.355016  3.081852  3.359998
## 9         m  num_quintile60~80% 37.971633  2.758828 13.763683
## 10        m num_quintile80~100% 43.900780  2.997436 14.646113
## Variables not shown: p.value (dbl)

Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater season best score of short program.

##Multivariate linear regression model for ladies
fit_fs_l<- bsb_fs %>% filter(category=="l") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_fs_l)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.386 -12.310  -0.954  11.644  58.737 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         116.57262   15.55604   7.494 1.45e-13 ***
## age                   0.43612    0.32042   1.361    0.174    
## height               -0.51769    0.09557  -5.417 7.56e-08 ***
## career_length         1.83409    0.27792   6.599 6.64e-11 ***
## num_quintile20~40%    9.18052    1.77569   5.170 2.82e-07 ***
## num_quintile40~60%    8.68480    1.86914   4.646 3.82e-06 ***
## num_quintile60~80%   28.74391    1.69065  17.002  < 2e-16 ***
## num_quintile80~100%  40.33783    1.84811  21.826  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.39 on 1017 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.5115, Adjusted R-squared:  0.5081 
## F-statistic: 152.1 on 7 and 1017 DF,  p-value: < 2.2e-16
##Multivariate linear regression model for men
fit_fs_m<- bsb_fs %>% filter(category=="m") %>% 
  lm(score~ age+height+career_length+num_quintile, data= .)
summary(fit_fs_m)
## 
## Call:
## lm(formula = score ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.181 -15.812  -3.602  13.092  94.988 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         162.2113    23.0838   7.027 4.35e-12 ***
## age                   2.0258     0.4052   5.000 6.97e-07 ***
## height               -0.7537     0.1324  -5.694 1.72e-08 ***
## career_length         1.8892     0.3478   5.432 7.31e-08 ***
## num_quintile20~40%    5.0810     2.5744   1.974  0.04875 *  
## num_quintile40~60%    8.7599     2.6930   3.253  0.00119 ** 
## num_quintile60~80%   30.4064     2.3947  12.697  < 2e-16 ***
## num_quintile80~100%  36.1535     2.6269  13.763  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.38 on 843 degrees of freedom
##   (39 observations deleted due to missingness)
## Multiple R-squared:  0.4926, Adjusted R-squared:  0.4884 
## F-statistic: 116.9 on 7 and 843 DF,  p-value: < 2.2e-16

In the final multivariate models for men and ladies,nearly all the predictors still have statistically significant relationship with season best score of free skating after adjusting the rest of the biography variables, except for the association of age with the outcome among ladies. Around 50% of the variance of season best score of free skating can be explained by age, height, career length and country rank.

##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit_fs<- bsb_fs %>% group_by(category) %>% 
  do(tidy(lm(score~ age+height+career_length+num_quintile,data= .),data=.))

fit_fs<- tbl_df(fit_fs)

fit_fs %>% filter(term=="age") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('FS Estimated based on Age')+geom_hline(yintercept = 0)

fit_fs %>% filter(term=="height") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('FS Estimated based on Height')+geom_hline(yintercept = 0)

fit_fs %>% filter(term=="career_length") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
  geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+
  facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('FS Estimated based on Career Length')+geom_hline(yintercept = 0)

fit_fs %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
                         "num_quintile60~80%","num_quintile80~100%")) %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) +
   ggtitle('FS Estimated based on Country Rank')+geom_hline(yintercept = 0)

We can see the magnitude of the associations of age with season best score of free skating varies by sex.

Linear Regression for Single Event Scores

seb<- single_event_bio
seb %>% filter(is.na(name)==TRUE) %>% summarise(n())
## Source: local data frame [1 x 1]
## 
##     n()
##   (int)
## 1   142

outof1392singles, there are 142 missing biography info.

#Remove observations with missing biography
seb<- seb %>% filter(is.na(name)==FALSE) 

In this part, we ran linear regression to explore the associations between biography variables and individual event total scores for male and female skaters.

seb %>% ggplot(aes(age,total,color=category))+geom_point()+
  geom_smooth(method="lm")+facet_grid(.~category) +
  ggtitle('Association between Age and Event Total Score')

seb %>% group_by(category) %>% do(tidy(lm(total~ age,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term    estimate  std.error   statistic      p.value
##      (chr)       (chr)       (dbl)      (dbl)       (dbl)        (dbl)
## 1   Ladies (Intercept) 142.8282261  7.9832861 17.89090671 1.853314e-57
## 2   Ladies         age   0.0125651  0.3709298  0.03387461 9.729887e-01
## 3      Men (Intercept) 156.7544514 11.3893093 13.76329738 4.922588e-38
## 4      Men         age   1.4914508  0.4893586  3.04776627 2.398448e-03

Age has a significantly positive relationship with event total score for men only.

seb %>% ggplot(aes(height,total,color=category))+
  geom_point()+geom_smooth(method="lm")+facet_grid(.~category) +
  ggtitle('Association between Height and Event Total Score')

seb %>% group_by(category) %>% do(tidy(lm(total~ height,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category        term    estimate  std.error statistic      p.value
##      (chr)       (chr)       (dbl)      (dbl)     (dbl)        (dbl)
## 1   Ladies (Intercept)  74.1585103 42.3181108  1.752406 8.023151e-02
## 2   Ladies      height   0.4254202  0.2607342  1.631624 1.033006e-01
## 3      Men (Intercept) 357.9579385 45.6126343  7.847780 1.740790e-14
## 4      Men      height  -0.9596096  0.2625566 -3.654868 2.780228e-04

Height has a significantly negative relationship with event total score for men.

seb %>% ggplot(aes(career_length,total,color=category))+
  geom_point()+geom_smooth(method="lm")+
  facet_grid(.~category) +
  ggtitle('Association between Career Length and Event Total Score')

seb %>% group_by(category) %>% do(tidy(lm(total~ career_length,data= .),data= .)) 
## Source: local data frame [4 x 6]
## Groups: category [2]
## 
##   category          term   estimate std.error statistic      p.value
##      (chr)         (chr)      (dbl)     (dbl)     (dbl)        (dbl)
## 1   Ladies   (Intercept) 131.675512 5.2065579 25.290319 1.400202e-95
## 2   Ladies career_length   0.765271 0.3186156  2.401863 1.662641e-02
## 3      Men   (Intercept) 145.342432 6.5874661 22.063481 6.849744e-81
## 4      Men career_length   2.717931 0.3776833  7.196323 1.716764e-12

For both ladies and men, career length has a significantly positive relationship with event total score.

seb %>% ggplot(aes(num_quintile,total,color=category))+
  geom_boxplot()+facet_grid(category~.) +
  ggtitle('Association between Country Rank and Event Total Score')

seb %>% group_by(category) %>% 
  do(tidy(lm(total~ num_quintile,data= .),data= .)) 
## Source: local data frame [10 x 6]
## Groups: category [2]
## 
##    category                term   estimate std.error statistic
##       (chr)               (chr)      (dbl)     (dbl)     (dbl)
## 1    Ladies         (Intercept) 116.585946  2.966308 39.303382
## 2    Ladies  num_quintile20~40%  14.371673  4.068227  3.532663
## 3    Ladies  num_quintile40~60%   6.751807  4.014349  1.681918
## 4    Ladies  num_quintile60~80%  35.705840  3.481530 10.255790
## 5    Ladies num_quintile80~100%  46.626657  3.641256 12.805103
## 6       Men         (Intercept) 166.613208  3.367635 49.474836
## 7       Men  num_quintile20~40%   6.956973  4.708618  1.477498
## 8       Men  num_quintile40~60%  10.805792  4.719061  2.289818
## 9       Men  num_quintile60~80%  42.314174  4.062066 10.416910
## 10      Men num_quintile80~100%  44.565022  4.885006  9.122818
## Variables not shown: p.value (dbl)

Compared to skaters from country in the lowest quintile of number of skaters, skaters from country in higher quintile have greater event total score.

##Multivariate linear regression model for ladies
fit_l<- seb %>% filter(category=="Ladies") %>% 
  lm(total~ age+height+career_length+num_quintile, data= .)
summary(fit_l)
## 
## Call:
## lm(formula = total ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72.35 -17.82   1.19  17.35  62.17 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          46.0806    34.9298   1.319  0.18762    
## age                  -1.4905     0.5829  -2.557  0.01082 *  
## height                0.4129     0.2123   1.945  0.05229 .  
## career_length         2.3060     0.4961   4.649 4.16e-06 ***
## num_quintile20~40%   12.6395     3.9253   3.220  0.00136 ** 
## num_quintile40~60%    4.9267     3.9492   1.248  0.21273    
## num_quintile60~80%   36.6077     3.3582  10.901  < 2e-16 ***
## num_quintile80~100%  45.9865     3.6088  12.743  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.29 on 566 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.361 
## F-statistic: 47.25 on 7 and 566 DF,  p-value: < 2.2e-16
##Multivariate linear regression model for men
fit_m<- seb %>% filter(category=="Men") %>% 
  lm(total~ age+height+career_length+num_quintile, data= .)
summary(fit_m)
## 
## Call:
## lm(formula = total ~ age + height + career_length + num_quintile, 
##     data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.989 -22.242  -1.045  22.180 117.290 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         284.9161    42.9873   6.628 7.23e-11 ***
## age                  -2.0043     0.6646  -3.016  0.00267 ** 
## height               -0.7091     0.2384  -2.974  0.00305 ** 
## career_length         3.1544     0.5307   5.944 4.56e-09 ***
## num_quintile20~40%    5.9951     4.5885   1.307  0.19184    
## num_quintile40~60%   11.3235     4.6101   2.456  0.01430 *  
## num_quintile60~80%   36.9621     4.0160   9.204  < 2e-16 ***
## num_quintile80~100%  42.4191     4.7499   8.930  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.4 on 641 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.2838, Adjusted R-squared:  0.276 
## F-statistic: 36.28 on 7 and 641 DF,  p-value: < 2.2e-16

In the final multivariate model for ladies, the association between age and event total score became statistically significant after adjusting the rest of the biography variables. These biography variables account for 36% of the variance of event total score, which is less than the 50% for explaining season best scores.

In the final multivariate model for men, the associations of biography variables with event total score changed slightly after adjusting the rest of the biography variables, except for age became negatively associated with outcome. However, these biography variables can account only 28% of the variance of event total score for men, which is 8% less than the proportion explained by biography variables of ladies and is 20% less than the proportion for explaining season best scores among men.

##Comparing the point estimate of coefficients with 95% conficence interval between sex.
fit<- seb %>% group_by(category) %>% 
  do(tidy(lm(total~ age+height+career_length+num_quintile,data= .),data=.))

fit<- tbl_df(fit)

fit %>% filter(term=="age") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('Estimate of Event Total Score by Age')+geom_hline(yintercept = 0)

##After adjusting for height,career_length and num_quintile, the association of age with total score became negative. 

fit %>% filter(term=="height") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+coord_fixed(ratio=1.5) +
  ggtitle('Estimate of Event Total Score by Height')+geom_hline(yintercept = 0)

##After adjusting for height,career_length and num_quintile, the association of height with total score isn't statistically significant.

fit %>% filter(term=="career_length") %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+
  geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(.~category)+
  coord_fixed(ratio=1.5) +
  ggtitle('Estimate of Event Total Score by Career Length')+geom_hline(yintercept = 0)

##After adjusting for height,career_length and num_quintile, the positive association of career length with total score is statistically significant.

fit %>% filter(term%in%c("num_quintile20~40%","num_quintile40~60%",
                         "num_quintile60~80%","num_quintile80~100%")) %>% 
  mutate(ci_lower=estimate-1.96*std.error,ci_upper=estimate+1.96*std.error) %>%
  ggplot(aes(x=term,y=estimate,color=category))+geom_point()+geom_errorbar(aes(ymax=ci_upper,ymin=ci_lower))+ facet_grid(category~.) + 
  ggtitle('Estimate of Event Total Score by Country Rank')+geom_hline(yintercept = 0)

We can see the opposite direction of the associations of height with event total score comparing ladies to men. Taller skaters tend to score higher among ladies, however, shorter skaters are more likely to score higher among men, holding other variables constant. It is also interesting the note that holding career length constant, younger skaters tend to score higher for both ladies and men.

Machine Learning

We predicted Season Best Scores (SBS) from different machine learning methods.

Instead of predicting exact SBS, we would like to keep it simple by separating SBS into quartiles based on total score from 2015. Q1 is the highest scores while Q4 is the lowest. Our goal was be to predict 2015 quartile of Season Best Total Score correctly by using past scores and biography data.

Due to sample size, we focused only on single men and single ladies. We looked at the predicitive models for men and ladies separately because their SBS distributions were very different, as discovered during exploratory analysis.

The machine learning methods we examined were kmean, knn and LDA. For KNN and LDA, we trained with 80% of the data set and used 20% of the data as the test set. We evaluated the accuracy of the method by calculating percentage of data with predicted quartile, same as actual quartile for 2015 SBS. We applied 100 replications for each method.

library(cowplot)
library(stringr)
library(caret)
library(dendextend)

First we reformated the table.

cleaned.df <- seasonbestsingle_bio %>% 
  mutate(UID = paste(name,category, sep="_")) %>%
  mutate(UID = paste(UID,program_type, sep="_")) %>%
  filter(!is.na(name))

scorespread.df <-cleaned.df  %>%
  filter(program_type == "to") %>%
  select(UID, score, season_start_year) %>%
  mutate(season_start_year = paste0("yr_",as.character(season_start_year))) %>%
  #select(name, category, score, season_start_year) %>%
  spread(season_start_year, score) %>%
  filter(!is.na(yr_2015))

info.df <- cleaned.df %>%
  filter(season_start_year == 2015) %>%
  mutate(countryQuintile_by_skaterNum = num_quintile) %>%
  select(UID, first_name, last_name, age, career_length,countryQuintile_by_skaterNum, height, country, category, name) %>%
  distinct()

labels<- c("4th Q","3rd Q","2nd Q", "1st Q")

scorespread.m.df <-
  scorespread.df %>%
  filter(str_sub(UID, start= -4) == "m_to")   %>%
  mutate(SBS_2015_group =cut(yr_2015,
                          breaks=quantile(yr_2015,prob=seq(0,1,0.25)),
                          include.lowest=TRUE,
                          labels=labels))

scorespread.l.df <-
  scorespread.df %>%
  filter(str_sub(UID, start= -4) == "l_to")   %>%
  mutate(SBS_2015_group =cut(yr_2015,
                          breaks=quantile(yr_2015,prob=seq(0,1,0.25)),
                          include.lowest=TRUE,
                          labels=labels))

scorespread.m.df <- scorespread.m.df %>%
  left_join(info.df, by ="UID") 
scorespread.l.df <- scorespread.l.df %>%
  left_join(info.df, by ="UID") 


scorespread.df<-bind_rows(scorespread.m.df,scorespread.l.df)


rm(cleaned.df)

We then created a matrix with data to use for different machine learning methods.

as.numeric.factor <- function(x) {seq_along(levels(x))[x]}

formatMatrix <- function(df){
  X <-
    df %>%
    select(-first_name,-last_name, -name, - SBS_2015_group, -yr_2015, -country, -UID) %>%
    mutate(category = as.factor(category))
  X[is.na(X)] <- 0
  rownames(X) <-df$UID
  return (X)
}
data.m.df <-formatMatrix(scorespread.m.df)
data.l.df <-formatMatrix(scorespread.l.df)
#X <- X %>%
#mutate(category = as.numeric.factor(category)) %>%
#mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))  

Hierarchical clustering

#hierarchical clustering 

## function to set label color
labelCol <- function(x) {
  if (is.leaf(x)) {
    ## fetch label
    label <- attr(x, "label")
    code <- substr(label, 1, 1)
    ## use the following line to reset the label to one letter code
    # attr(x, "label") <- code
    attr(x, "nodePar") <- list(lab.col=colorCodes[code])
  }
  return(x)
}

colorCodes <- c(A="red", B="green", C="blue", D="yellow")

for( i in 1:2){
  if (i == 1) {
    data <-data.m.df 
    rownames(data)<-str_replace(rownames(data), "_m_to","")
    title <- "Hierarchical clustering  for men"
    namesource.df <-scorespread.m.df
  }
  else {
    data <-data.l.df
    rownames(data)<-str_replace(rownames(data), "_l_to","")
    title <- "Hierarchical clustering  for ladies"
    namesource.df <-scorespread.l.df
  }

  data <- dist(data, method = c("euclidian"))
  hcdata <- hclust(data, method = c("average"))
  ## apply labelCol on all nodes of the dendrogram
  dend <- as.dendrogram(hcdata)
  name.df <- data.frame(labels(dend))
  colnames(name.df)[1] <-"name"
  code <- name.df %>%
    left_join(namesource.df, by = "name") %>%
    select(SBS_2015_group)
  code <- code$SBS_2015_group
  
  code <- str_replace(code, "1st Q","red")
  code <- str_replace(code, "2nd Q","green")
  code <- str_replace(code, "3rd Q","blue")
  code <- str_replace(code, "4th Q","black")
  
  dend %>% 
    set("labels_cex", 0.4) %>%
    set("labels_colors", code) %>% 
    plot(main = title)

}

Due to the number of samples, it is very hard to read the label, so we colored the label by SBS 2015 quartile . Each label is a name of a skater. Q1 skaters are in red, Q2 skaters are in green, Q3 skaters are in blue, and Q4 skaters are in black.

From hierarchical clustering plot, for men we can see that Q1 skaters are grouped together at the far left corner and far right corner. Q4 skaters are in the middle. The separation between Q2 and Q3 are less obvious.

For ladies, Q1 skaters are grouped together at the far left corner and far right corner. The separation between Q2, Q3, and Q4 are less obvious.

Overall, hierarchical clustering identified the top performers of 2015 well, but did not identify other quartiles as clearly.

PCA

#men
data <- data.m.df %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))  

pc.cr.m = prcomp(data)
summary(pc.cr.m)
## Importance of components:
##                             PC1     PC2      PC3      PC4      PC5
## Standard deviation     177.1683 82.7412 60.97076 47.41208 36.10341
## Proportion of Variance   0.6441  0.1405  0.07628  0.04613  0.02675
## Cumulative Proportion    0.6441  0.7846  0.86087  0.90700  0.93375
##                             PC6      PC7      PC8     PC9   PC10    PC11
## Standard deviation     35.46903 33.04967 29.22991 4.18670 2.1605 1.30981
## Proportion of Variance  0.02582  0.02241  0.01753 0.00036 0.0001 0.00004
## Cumulative Proportion   0.95956  0.98198  0.99951 0.99987 1.0000 1.00000
##                        PC12
## Standard deviation        0
## Proportion of Variance    0
## Cumulative Proportion     1
#ladies
data <- data.l.df %>%
mutate(category = as.numeric.factor(category)) %>%
mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))  

pc.cr.l = prcomp(data)
summary(pc.cr.l)
## Importance of components:
##                             PC1     PC2      PC3      PC4      PC5
## Standard deviation     111.8550 59.4855 41.51743 36.62834 34.04632
## Proportion of Variance   0.5787  0.1637  0.07973  0.06206  0.05362
## Cumulative Proportion    0.5787  0.7424  0.82215  0.88421  0.93782
##                             PC6     PC7      PC8    PC9    PC10    PC11
## Standard deviation     26.51272 22.3965 10.88794 3.8908 2.05185 1.31120
## Proportion of Variance  0.03251  0.0232  0.00548 0.0007 0.00019 0.00008
## Cumulative Proportion   0.97034  0.9935  0.99903 0.9997 0.99992 1.00000
##                        PC12
## Standard deviation        0
## Proportion of Variance    0
## Cumulative Proportion     1

From PCA summary, for men, a total of 78.46% of the variation in the data is captured in the first two principle components. If we use the first 4 PCA components, we can capture 90.70% of variation in the data. For ladies, a total of 74.24% of the variation in the data is captured in the first two principle components. If we use the first 4 PCA components, we can capture 88.42% of the variation in the data.

k-mean clustering

We created a helper function below to calculate accuracy for k-mean clustering.

calculate.confusion <- function(states, clusters)
  {
    # generate a confusion matrix of cols C versus states S
    d <- data.frame(state = states, cluster = clusters)
    td <- as.data.frame(table(d))
    # convert from raw counts to percentage of each label
    pc <- matrix(ncol=max(clusters),nrow=0) # k cols
    for (i in 1:4) # 4 labels
    {
        #i <-1
      total <- sum(td[td$state==td$state[i],3])
      pc <- rbind(pc, td[td$state==td$state[i],3]/total)
    }
    rownames(pc) <- td[1:4,1]
    return(pc)
  }
  assign.cluster.labels <- function(cm, k)
  {
    # take the cluster label from the highest percentage in that column
    cluster.labels <- list()
    for (i in 1:k)
    {
      cluster.labels <- rbind(cluster.labels, row.names(cm)[match(max(cm[,i]), cm[,i])])
    }
  

    return(cluster.labels)
  }
  calculate.accuracy <- function(states, clabels)
  {
    matching <- Map(function(state, labels) { state %in% labels }, states, clabels)
    tf <- unlist(matching, use.names=FALSE)
    return (sum(tf)/length(tf))
  }
  
  
  getKMeanAccuracy<- function(data, y, k){
    
    cl <- kmeans(data, centers=k)
    conf.mat <- calculate.confusion(y, cl$cluster)
    cluster.labels <- assign.cluster.labels(conf.mat,k)
    acc <- calculate.accuracy(y, cluster.labels[cl$cluster])
    return (acc)  
  }
accuracy.kmean.df <- 
  data.frame(
            
            category = character(),
            k = integer(),
            PCA = integer(),
            accuracy = double(), 
            sd = double(),
            method = character()
  )  
for (i in 1:2) {
  if (i == 1) {
    pc.cr <- pc.cr.m
    title.add <- "for men"
    SBS_2015_group <-scorespread.m.df$SBS_2015_group
    category <- "m"
  }
  else {
     pc.cr <- pc.cr.l
     title.add <- "for ladies"
     SBS_2015_group <-scorespread.l.df$SBS_2015_group
     category <- "l"
  }
  
  data <-pc.cr$x[,1:2]
  pca1 <-pc.cr$x[,1]
  pca2 <-pc.cr$x[,2]
  
  # find optimal number of cluster for k means
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(data,
                                       centers=i)$withinss)
  plot(1:15, wss, type="b", xlab="Number of Clusters",
            ylab="Within groups sum of squares", main = paste("optimal k-mean cluster with first 2 PCA", title.add))
  
  data <-pc.cr$x[,1:4]
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(data,
                                       centers=i)$withinss)
  plot(1:15, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares", main = paste("optimal k-mean cluster with first 4 PCA", title.add))
  
  pc.df<-data.frame(PCA1=pca1, PCA2=pca2, SBS_2015_group=factor(SBS_2015_group))
  print(ggplot(pc.df, aes(x=PCA1, y=PCA2,  color=SBS_2015_group)) + 
    geom_point() + ggtitle(paste("SBS of 2015 using first 2 PCA",title.add)))
  data <-pc.cr$x[,1:2]
  cl <- kmeans(data, centers=4)
  pc.df<-data.frame(PCA1=pca1, PCA2=pca2, Cluster=factor(cl$cluster))
  print(ggplot(pc.df, aes(x=PCA1, y=PCA2,  color=Cluster)) + 
    geom_point() + ggtitle(paste("kmean using first 2 PCA",title.add)))

  data <-pc.cr$x[,1:4]
  cl <- kmeans(data, centers=4)
  pc.df<-data.frame(PCA1=pca1, PCA2=pca2, Cluster=factor(cl$cluster))
  print(ggplot(pc.df, aes(x=PCA1, y=PCA2,  color=Cluster)) + 
    geom_point() + ggtitle(paste("kmean using first 4 PCA",title.add)))
  
   for (p in 2:6){
     data <-pc.cr$x[,1:p]
     for (k in 4:10){
       accuracy.kmean<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
      new.df <- data.frame(category = category, k =k , PCA = p,accuracy= mean(accuracy.kmean),sd= sd(accuracy.kmean),method = "kmean")
    accuracy.kmean.df <- bind_rows(accuracy.kmean.df, new.df)
     }
    #data <-pc.cr$x[,1:2]
    #accuracy.kmean.pca2<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
    #cat("For cluster =", k, ", kmean with using first 2 PCA acheive accuracy with mean = ", mean(accuracy.kmean.pca2), ", sd =",sd(accuracy.kmean.pca2), " ",title.add, "\n")
    #data <-pc.cr$x[,1:4]
    #accuracy.kmean.pca4<-replicate(100,getKMeanAccuracy(data, SBS_2015_group, k))
    #cat("For cluster =", k, ", kmean with using first 4 PCA acheive accuracy with mean = ", mean(accuracy.kmean.pca4), ", sd =",sd(accuracy.kmean.pca4), " ",title.add, "\n")
  }
}

accuracy.kmean.df%>%
  ggplot(aes(x=k,y=accuracy))+
  geom_boxplot(aes( group = k)) +
  xlab("k")+
  ggtitle("Kmean accuray")+ 
  facet_wrap(~ category, scale = "free")

accuracy.kmean.df%>%
  ggplot(aes(x=PCA,y=accuracy))+
  geom_boxplot(aes( group = PCA)) +
  xlab("PCA")+
  ggtitle("Kmean accuray")+ 
  facet_wrap(~ category, scale = "free")

do.call(rbind,lapply(split(accuracy.kmean.df,accuracy.kmean.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))  
## Source: local data frame [2 x 6]
## 
##   category     k   PCA  accuracy         sd method
##      (chr) (int) (int)     (dbl)      (dbl)  (chr)
## 1        l    10     2 0.5021176 0.01427521  kmean
## 2        m    10     3 0.4596682 0.01710113  kmean

The table and graph above show the optimal k and PCA to use for knn. Around 50% was the optimal accuracy achieved.

For men, if we used the first 2 PCA or first 4 PCA, the optimal cluster is 5 or 6. For ladies, if we used the first 2 PCA or first 4 PCA, the optimal cluster is 6. Since we divided SPB 2015 into quartiles, we used 4 as our cluster number, which is also close to the optimal k-mean cluster.

Overall, kmean has higher accurarcy for ladies than for men. Increasing k helps to improve accuracy. But this is expected because we created more clusters and each cluster had a label of one of the quartiles. Increasing PCA does not always improve accuracy.

KNN

library(class)


getAccuracyForKNN <-function(X, k){
  inTrain <- createDataPartition(y = X$SBS_2015_group, p=0.8)$Resample1
  train <- slice(X, inTrain)
  test <- slice(X, -inTrain)
  train.class <- as.factor(train$SBS_2015_group)
  test.class <- as.factor(test$SBS_2015_group)
  train <- train %>% select(-SBS_2015_group)
  test <- test %>% select(-SBS_2015_group)
  #train <- train[, -ncol(train)]
  #test <- test[, -ncol(test)]
  test<-test%>%filter(complete.cases(.))
  y_hat_knn <- knn(train = train,test =  test, cl = train.class, k = k, prob=FALSE)
  acc <- mean(test.class==y_hat_knn)

  return(acc)
  
}  
accuracy.knn.df <- 
  data.frame(
            
            category = character(),
            parameter = integer(),
            accuracy = double(), 
            sd = double(),
            method = character()
  )  
  
for ( i in 1:2){
  if (i == 1) {
    SBS_2015_group<- scorespread.m.df$SBS_2015_group
    
    X <- data.m.df
    
    title.add <- "for men"
    category <- "m"
  }else {
    SBS_2015_group<- scorespread.l.df$SBS_2015_group
     
      X <- data.l.df
     title.add <- "for ladies"
   category <- "l"
  }
  X <- X %>%
    mutate(category = as.numeric.factor(category)) %>%
    mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
  X <- bind_cols(X,data.frame(SBS_2015_group))
  

  for (k in 1:10){
    accuracy.knn<-replicate(100,getAccuracyForKNN(X, k))
    #  cat("For  neighbor =", k, ", knn acheives accuracy with mean = ", mean(accuracy.knn), ", sd =",sd(accuracy.knn), " ",title.add, "\n")
    new.df <- data.frame(category = category, parameter =k ,accuracy= mean(accuracy.knn),sd= sd(accuracy.knn),method = "knn")
    accuracy.knn.df <- bind_rows(accuracy.knn.df, new.df)
  }
}
accuracy.knn.df%>%
  ggplot(aes(parameter,accuracy))+
  geom_point(aes(color=category)) +
  xlab("number of neighbor") +
  ggtitle("KNN accuray")

do.call(rbind,lapply(split(accuracy.knn.df,accuracy.knn.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))  
## Source: local data frame [2 x 5]
## 
##   category parameter  accuracy         sd method
##      (chr)     (int)     (dbl)      (dbl)  (chr)
## 1        l         6 0.5289583 0.06357072    knn
## 2        m         4 0.4667500 0.06779863    knn

The table and graph above show the optimal number of neighbor to use for knn. Knn has a higher accuracy to preduict SBS quartile for ladies then for men.

LDA

library(MASS)

getAccuracyForLDA <-function(X, k){
  inTrain <- createDataPartition(y = X$SBS_2015_group, p=0.8)$Resample1
  train <- slice(X, inTrain)
  test <- slice(X, -inTrain)
  train.class <- as.factor(train$SBS_2015_group)
  test.class <- as.factor(test$SBS_2015_group)
  train <- train %>% dplyr::select(-SBS_2015_group, -category)
  test <- test %>% dplyr::select(-SBS_2015_group, -category)
  #train <- train[, -ncol(train)]
  #test <- test[, -ncol(test)]
  
  test<-test%>%filter(complete.cases(.))
  
  data.lda <- lda(train.class~.,
                data=train)
  #Project data on linear discriminants
  data.lda.y_hat <- predict(data.lda, test)
  
  #Extract scaling for each predictor 
  #data2.lda <- data.frame(varnames=rownames(coef(data.lda)), coef(data.lda))

  acc <- mean(test.class==data.lda.y_hat$class)

  return(acc)
  
}  
accuracy.lda.df <- 
  data.frame(
            
            category = character(),
            #parameter = integer(),
            accuracy = double(), 
            sd = double(),
            method = character()
  )  
  
for ( i in 1:2){
  if (i == 1) {
    SBS_2015_group<- scorespread.m.df$SBS_2015_group
    
    X <- data.m.df
    
    title.add <- "for men"
    category <- "m"
  }else {
    SBS_2015_group<- scorespread.l.df$SBS_2015_group
     
      X <- data.l.df
     title.add <- "for ladies"
   category <- "l"
  }
  X <- X %>%
    mutate(category = as.numeric.factor(category)) %>%
    mutate(countryQuintile_by_skaterNum = as.numeric.factor(countryQuintile_by_skaterNum))
  X <- bind_cols(X,data.frame(SBS_2015_group))
  
  accuracy.lda<-replicate(100,getAccuracyForLDA(X, k))
      new.df <- data.frame(category = category ,accuracy= mean(accuracy.lda),sd= sd(accuracy.lda),method = "lda")
      accuracy.lda.df <- bind_rows(accuracy.lda.df, new.df)
}
do.call(rbind,lapply(split(accuracy.lda.df,accuracy.lda.df$category),function(chunk) chunk[which.max(chunk$accuracy),]))  
## Source: local data frame [2 x 4]
## 
##   category  accuracy         sd method
##      (chr)     (dbl)      (dbl)  (chr)
## 1        l 0.5320833 0.06078742    lda
## 2        m 0.4597500 0.06541389    lda

The table and graph above show the optimal number of neighbors to use for LDA. LDA achieves the highest accuracy for ladies after 100 replications from using 20% of data as test set while KNN achieves the highest accuracy for men. Both have similar accuracy standard deviation of 6%.

Each method has higher accuracy to predict 2015 SBS quartile for ladies than for men.

The reason why the accuracy is not high can be that not all skater have every SBS score since 2008, ex. some skaters only started since 2012.

In the future, we can try other machine learning method, like SVM.